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INTRODUCTION 


A narrow bandwidth speech processor has been developed and a bread- 
board model exists as part of the Audio and Voice Communications Laboratory. 
This breadboard demonstrates that human speech can be transmitted m a 
l60 Hz communications channel. Also, the type circuits used provides means 
for maintaining minimum size and weight requirements imposed by spacecraft 
applications. This report will present a discussion of this development 
and some possibilities the system has for spacecraft applications. 

This report is divided into four sections. Section one will present 
the basic concepts of the processor. Section two will discuss some 
aspects of how the processor was developed. Section three will present 
a simplified analysis of how the processor operates. Finally, section 
four will discuss applications of the processor with particular reference 
to its development status. An appendix is provided which contains in- 
directly related information. 



I. PROCESSOR CONCEPT 


The concept of the narrow bandwidth speech processor lies within 
the discovery of a technique known as the single equivalent formant 
(SEE). This technique is a method by which parameters are extracted 
from speech that describes an equivalence of the formant structure of 
speech. However, before discussing this further, it is necessary that 
nhe basic structure of human speech be understood. The following few 
paragraphs are an attempt to provide this understanding. 

Speech may be defined as humanly generated utterances, logically 
formed to produce meaningful sounds, as related to written text. The 
source of these utterances is, m part, the product of air from the 
lungs and vocal cord control. Air is generated from the lungs which is 
controlled by the vocal cord m the form of periodic bursts. The bursts 
of air set up ringing effects m a number of cavities along the vocal 
track. The vocal track is the hollow volume of the human head and neck, 
terminated on one end by the vocal cord, and on the other end by the lips, 
teeeh, mouth, tongue, and nasal cavity, or passage. 

There are times during speech production when air from, the lungs is 
unaffected by the vocal cord. At such time, constrictive parts of the 
vocal track (primarily lips and teeth) play an important role m speech 
generation. They, along with the cavities of the vocal tract, are 
instrumental m providing the logic that forms the utterances into meaning- 
ful sounds. How this is done can be shown with the following definitions, 
explanations and diagrams. 
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The structure of speech xs known to he made up of a number of 
basxc parameters which associate themselves with the operations performed 
on the air from the lungs "by the vocal track, for example, the periodic 
hursts of air caused by the vocal cord. This operation provides a 
parameter m speech structure known as pitch and is defined as the rate 
at which the hursts of air by the vocal cord are permitted to occur. 

Likewise, the hursts of air cause resonances m the cavities of the 
vocal track. The resonances are called formants and are associated m 
a composite manner with a term known as voiced sounds. 

During the time when the vocal cord does not operate on the air from 
the lungs, the constrictions of the vocal track provide another parameter 
m the form of white noise and is termed unvoiced sound. 



a. Speech pitch (air hursts) 



b. Composite formant structure (voiced sound with pinch) 




c. Noise or Unvoiced Sound 



d. Composite Speech Signal with pitch and voiced or 


unvoiced sound. 


FIGURE 1. 


FKruvft' »)■ 
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If a visual perception of these parameters is required, one might 
think of their appearance as shown m Figure 1. a, b, c and d. 

In a frequency spectrum the composite voice signal would appear as 
shown m Figure 2 


il, 

» ‘‘t 
i * 

11 

hi 


~ "\7 ij? s <■" 

, < ! {j r »'*,* i 

X M‘; If 

*1:1: h i < 

,{ : * V ■. 

# 1 ( 

- ' 

■ JM&i 


f t 
*-i v l r 


"W-.SM'sv:, v . 

■’pWf ' 


'sf'ki ' 

•WtlK 

If i t 


I ,1 , 

i !i > i i 


11 


i ' i 




i ffy 


■ ( » J y M si, 


) I 


« } 


ii»* * 






,-!G 


h — SH—*I+ 


•lT'*L 

Q£ — 




F, c 2 

To this point, it may he said that the structure of speech is made - 
up of three basic parameters - (l) Pitch, (2) Formants (voiced sounds) 
and (3) Noise (unvoiced sounds). 

(l) Pitch is the rate at which the vocal cord permits voiced sounds 


to occur. 



4 


(2) Formants are resonances that occur m the cavities of the 
vocal track. 

( 3 ) Noise or unvoiced sounds are those sounds produced by the 
constrictive parts of the vocal tract. 

All of these parameters may be related to the phonetic structure 
of a language thus providing a correlation between written text and 
humanly generated utterances. 

The phonetic structure of sound, as applied to the English language, 
is broken up into two general classes, vowels and consonants. In rela- 
tion to speech utterances, these classes associate themselves in a manner 

* 

such that vowels correspond to voiced sounds and consonants correspond 
to unvoiced sounds. However, not all consonants are unvoiced. There 
are some consonants that correspond to voiced sounds as well as unvoiced 
ones. These sounds are called comb matron sounds and are treated as such. 

The voiced speech sound, as found from basic speech studies, is known 
to consist of many formants. However, if the frequency content of speech 
is limited to below 3.5 khz, only three formants are known to exist. 

These three formants, along with pitch and unvoiced sound, are 
known to he sufficient for providing a fairly accurate replica of speech. 
It is from these formants that the SEF technique was discovered. 

(l) 

From an extensive study v of vowel sound perception and the three 
speech formants, it was learned that an equivalence existed for the three 
formants. The idea stemmed from the fact that when one hears, he does 

4 - 

not distinguish between individual formants. Instead, the ear seeks' to 
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hear an equivalence. The concept is the same as that of two colors m 
hue. As tiro different colors approach each other at the point of incident, 
the eye sees neither of the original colors. Instead, a third color 
unlike the other two is seen. With this m mind, the graph of Figure 3 
provides an approximation of this occurrence with reference to vowel 
sounds perception and speech formants. 

The concept of the single equivalent formant was formed from these 
ideas As shown in Figure 3, the three formants are described as functions 
of vowels and frequency. The heavy black curve represents the single 
equivalent formant (SEE 1 ). As indicated for back vowels, the first formant 
dominates for central vowels an average of the first and second formants 
dominates and finally, for the front vowels, the second formant dominates. 
Thus, the SEF appears to shift, depending on which vowel or group of 

u 

vowels occurs in the spoken text. 

Although indicated on the graph of Figure 3, the heavy black curve 
does 'not exist physically m speech. Hence, the concept lies m being 
able to recognize the dominate formant from real time speech. By being 
able to do this, and realizing that the change from one dominate form 
to another is never greater than at a 25 Hz rate, a low bandwidth pro- 
cessor can be constructed. Theoretically, this could be done bydescrib- - 
ing the SEF in frequency and amplitude , along with pitch and unvoiced 
sounds. Each of the four parameters can be made to vary at a 20 Hz rate 

A development was undertaken to implement this theory and concept , 

A discussion of that effort follows. As will be seen, direct implementa- 
tion of this concept was not achieved. However, the existing hardware is 
a direct result of the effort to do so. 
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Xl. SOME ASPECTS AS TO HOW THE PROCESSOR WAS DEVELOPED 

The development put forth to implement the SEP concept was the 
result of funds expended by MSG m a program outlined by contract 
HAS 9-^523 • The purpose of the program was to develop a laboratory narrow 
bandwidth speech processor for spacecraft applications. The characteristics 
of the processor were no be: 

(1) 160 Hz in communication channel bandwidth 

(2) Able to meet as a design goal 80$ phonetically balanced f>B) 
word intelligibility for a signal-to-noise ratio of 20 db 

O 

(3) Capable of meeting as a design goal an analyzer volume of 35 in.'. 

(4) Capable of being put into microcircuit form 

(5) Able to have a power consumption of no more than 5 watts 

The SEF technique appeared to offer great promise of meeting these require- 
ments . 

The program was initially focused on utilizing the SEF concept directly. 

An analyzer, built from earlier work, existed which provided the extracted 
SEF parameters (SEF period, SEF amplitude, pitch and voiced/unvoiced 
decision). Thus, work was concentrated on developing a synthesizer that 
would provide an intelligible speech replica from the given parameters. 

Three attempts were made to design a suitable synthesizer. The first 
involved use of a pitch oscillator to cohere a square wave single equivalent formaat 

oscillat or . The second, was c onstruc tio n of a _three_ formant .synthesizer, 

.The third, was construction of a.three_fprmant v synthesizer with digital 
formant shapers. Neither of these was able to provide a 
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suitable speech replica However, it was discovered, later that more 

c 

accurate definition of the SEF was necessary from the analyzer. This 

required designing a new an aly z er . i 

The new analyzer was defined as a SEF tracking filter analyzer 
It appeared to work well. However, suitable definition of all vowels 
was not possible. Either the A to i, or u, V, region could be de- 
fined, but not both simultaneously. 

A iwo formant analyzer of the same type was used to overcome this 
problem. The u, V, , vowel region was defined by the first formant 
and the A to i region was defined by the second formant. The entire ' 
vowel range could now be defined. 

Because the analyzer was now of the two formant nype, the synthesizer 
also was designed to be of the two formant type. Thus, the final system 
is a two formant speech processor. 

The final system does not exhibit direct implementation of the SEF 
technique, however, the concept remains the same. Also, the final system 
meets all of the intended requirments. 

A simplified analysis of how this processor operates is provided m 


the next section. 




f~iQ -S~ Synthesizer Block Diagram 
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III SIMPLIFIED AHALYSIS OF HC¥ 

the: processor operates 


The narrow bandwidth speech processor developed is a two formant 
processor and operates functionally in accordance with the diagrams of 
Figures 4 and 5- Figure 4 is a block diagram of the processor analyzer. 
Figure 5 is a block diagram of the processor synthesizer. 


+10 - IQ 




Two - Formant Analyzer 


The analyzer extracts directly from the speech signal the parameters that 


provide the low bandwidth. The synthesizer accepts these low hand limited 
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parameters and from them constructs a two formant replica of the speech 
input. She following detailed analysis, using waveforms, will provide 
some idea as to how this is done and what the final speech replica looks 
like 



A signal similar to that of Figure 6 is chosen as the input for 
the analysis. However, like that of Figure 6, this signal represents the 
word "shoe.” The word shoe was chosen because it contains all of the 
basic structures of speech pertinent to the operation of the processor. 
Also, the time span of the word when spoken normally is within the 
I m itations of this analysis. The word is also a good representation 
of a typical speech input. 

As shown m Figure 6, the word shoe contains two basic speech sounds 
which are unvoiced and voiced. The unvoiced appears as random noise and ' 
the voiced as a periodic wavetrain. These two properties are fundamental 
in the processor operation. The periodic wavetrain contains the formant 
structure and the noise like wave contains factors pertinent to the m- 

V 

telligibility of the spoken text. Thus, both properties must be processed 
in such a manner that they exist m the speech replica. 
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The task of performing this processing may he understood from 
nhe explanations and series of diagram to follow. The series of diagrams 
is an illustration of the processing of the word "shoe." The diagrams 
show how the word is broken down into five individual parameters, multi- 
plexed, made to vary at a l60 Hz rate, demultiplexed and logically 
formed to provide a replica of the word at the processor synthesizer 
output . 

The five parameter breakdown of the word is (l) first formant period 
(F 1 ) , (2) first formant amplitude A^, ( 3 ) second formant period F^, 

(4) second formant amplitude A p , and (5) pit ch/unvo iced decision P/V. 

In the actual processor extraction of these parameters is. done in a 
simultaneous manner. However, the diagrams here follow a sequence showing, 
m the analyzer, F^ extraction, extraction, F^ extraction, A 2 extraction 
and P/v extraction. In the synthesizer, p/v reconstruction is presented 
first, followed by combining p/y with F^ and to form the first formant, 
and commanding p/v with F^ and A^ to form the second formant. Finally, 
the two formants are summed to form the output speech replica. 


A. AHA.LYZER ANALYSIS 
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The analyzer block diagram of Figure 4 is repeated m Figure 7 to 
provide reference _ for the analyzer analysis. As shorn in this diagram, the 
speech is fed into the processor at the point indicated as the speech 
input. The speech then is passed through a 200 Hz to 750 Hz bandpass 
filter, a 750 Hz to 3-5 kHz bandpass filter, a pitch detector and a 
voicing detector. 

The route through the 200 to 750 Hz bandpass filter represents 
isolation of the first formant and is the first series of diagram to be 
discussed. In Figure 8 , a block diagram shows first formant isolation 
and F^, A^ extraction. From Figure 9? it is seen that the speech input 
is passed through a 200 Hz to 750 Hz bandpass filter. The output of this 
filter, as shown, is the variation of the first formant m both period 
and amplitude. This output, as illustrated 1 x 1 Figure 10, is first fed 
to the first formant period detector. Here, period averaging of each 
damped wave is performed and the output is the variation in the first 
formant period F^- However, as shown m Figure 11, this output is passed 
through a 20 Hz low pass filter. This filter is designed such that its 
output voltage is a function of the relative change m formant period. 

Thus, resulting m the final form for F-^. The theory that makes this 
approach feasible is that of recognizing that the rate of change of 
formant frequency m human speech is less than 20 Hz. This can be seen 
by analyzing the sonagraphs provided in the appendix of this report. 
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The amplitude parameter associated with the first formant, as 
shown xn Figure 12, is obtained by passing the bandpass filtered speech 
into an amplitude detector. The amplitude detector is a fullwave envelope 
type detector Thus, the output represents the peak~to-peak envelope of 
the input. This output, as indicated m Figure 13, is passed through a 
20 Hz low pass filter. The filter output is the relative change of the 
formant peak amplitude. A theory similar to that discussed for the 
formant period is found to be true of the amplitude peaks and is applied 
m this process. 

The second formant parameter extraction process is exactly the same 
as that for the first formant and is illustrated in Figure l4 through 
Figure 19 . However, because the second formant region occurs between 
750 Hz and 3.5 Hz, the period appears shorter than the first formant. 

The relative change m period and peak amplitudes remains in the same 
order as those of formant one. As a result of this operation, the third 
and fourth parameters are defined. 

The final extraction process, as illustrated in Figure 20 through 
Figure 24, is that of pitch/unvoiced decision. As seen m Figure 20, 
the speech input is fed simultaneously to a pitch detector and a voicing 
detector. 

The pitch detector analyzes the periodic part of the speech signal 
and determines in the time domain the beginning of each damped wave. 

As a result, a series of pulses is produced as the pitch detector output; 
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Each pulse corresponds to the beginning of a pitch* period The period 
is then the time between successive pulses. After production, these 
pulses are passed through a 20 Hz low pass filter. The output volt, age 
of the filter then produces a signal proportional to the period. 

The voicing detector, like the pitch detector, analyzes the speech 
signal. However, it is designed to provide a bi-level output signal 
for all zero crossings of the speech signal above 2200 Hz. Thus, the 
occurrence of unvoiced sounds is determined. This, is possible 
because the highest significant voiced sound zero crossing occurs at 
or below 2200 Hz. 

The output of the voicing detector, along with the pitch detector/ 
output, is used to drive an analog gate. Here, the two are corah ined 
to form the pitch/unvoiced decision parameter. 

Figure 21 shows the pitch detector input and output. This output 
is proportional to the input occurrence of the pitch pulses. 

' The output of the voicing detector, as shown m Figure 23, is 
a bi-level signal which says that when the level is +v, the speech 
input is unvoiced, and when the level is -v, the speech input is 
voiced. Figure 2k shows the addition of the unvoiced 
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decision and pitch. These two parameters can he combined as one because 
m speech they never occur simultaneously. The speech is either voiced 
or unvoiced^ but not both. 

t 

The combined parameterized speech output of the processor analyzer 
is shown m Figure 25. These parameters are individually fed to the 
multiplexer, shown in Figure 26. The multiplexer operates on these 
parameters and combines them m a process similar to that described in 
Figure 27 through Figure 30. The diagram of Figure 30 represents the 
composite analyzer output. However, this output is passed through a' 160 Hz 
low pass filter and the final compressed speech output of the analyzer is 
shovm m Figure 31. 

Figure 32 relates this compressed speech output to the speech input. 

As can be seen, the two signals are different. 
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Fig, 7 Two- Foimant Analyzer 
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B. SYE'JTHESIZBR ANALYSIS 


The synthesizer or reconstruction processor is somewhat the 
reverse of the analyzer operation The compressed signal is fed first 
■co the demultiplexer via a loO Hz low pass filter. The demultiplexed 
output is passed through a bank of 20 Hz low pass filters, thus, 
recovering the five speech parameters. A block diagram of the complete 
synthesis process is shown m Figure 33- The demultiplexing operation 
is shown in Figure 34 through Figure 38, Figure 34 shows a block 
diagram of the demultiplexer. Figure 35 shows the multiplexed analyzer 
output as the input to the demultiplexer. The demultiplexer output is 
shown m Figure 36 This output is fed through the bank of 20 Hz low 
pass filters shown m Figure 37. The outputs of these filters, shown 
m Figure 38 , are the recovered speech parameters. By comparing 
Figure 38 to Figure 25, a degree of degradation to the speech parameters 
due to the multiplexing process can be realized However, it must be 
kept m mind that the idea is somewhat over simplified. 

The speech parameters of Figure 38 are the ones used to reconstruct 
a replica of the original speech input. As noted in Figure 38, there 
are five individual parameters (l) F^, (2) A^, (3) i^,! (4)„A 2 and_.( 5.)_JB/T. 
explanation and series of diagram Figures 29 through 49 to follow, 
will show how a speech replica is formed from these five parameters. 



Considering parameter five first, and as illustrated m Figure 39? 
the Pixch/Unvoiced (p/v) decision is fed to a threshold detector and 
an inverting amplifier of unity gam The output of the threshold 
dexector, as shown in Figure 40, is the unvoiced decision and the output 
of the inverting amplifier also shown m Figure 40, is the pitch. Ix 
is obvious why this is true. 

Both of these outpuxs are fed to a pitch generator and noise 
source The ouxput of the pitch generator is shown m Figure 4-1. As 
illustrated, this outpux contains random noise and periodic pulses. 

The random noise represents unvoiced sound and the periodic pulses 

/ 

represent speech pitch. Boxh of these signals m their series occurrence are 
fed to two switch modulators along with first formant amplitude and second 
formanx amplitude, respectively. As a result of this operation, first and 
second formant reconstruction begins. 

Formant reconstruction is illusxrated m xhe diagrams of Figures 42 
through 47. The first formant is shown m Figure42 through 44, and the 
second formanx m Figure 45 xhrough 47. Figure 42 represents a block 
diagram of the first formant reconstruction process. The first formant 
amplitude parameter, the unvoiced decision and pitch are fed xo the inputs 
of switch modulator number one, as shown m Figure 43. The ouxput of 
switch modulator one, also shown in Figure 43, is a composite of unvoiced 
sound and amplitude modulated pitch pulses. This output is then used to 
drive the first formant bandpass active voltage xuned filter (AVTF) and 

V 

tracking low pass AVTF along with xhe first formant period paramexer'. 



The output of these two AVTF's, as shown m Figure 44-, is unvoiced 
sound plus the first formant The theory of reconstructing this formant 
lies m the fact that the modulated pulses from the switch modulator one 
cause the bandpass AVTF to ring and produce periodic damped sinusoids 
At the same time, the ringing frequency is made to vary as a function 
of the first formant period parameter F^. Thus, as a result of these 
operations, the first formant is reconstructed 

The second formant is reconstructed m the same manner as the first 
and is illustrated m Figure 4-5 through Figure 4?. However, after being 
reconstructed, it is combined with the first formant in a summing 
process. 

This summing process is performed in a summing amplifier as shown 
m Figure 48. The summed output, shown in Figure 49, is the replica of 
the speech input. As illustrated, the signal contains noise representing 
unvoiced sound and a periodic wavetrain representing voiced sound also 
corresponding to the SH and OE m the word “shoe." 

Figure 50 gives a summary illustration of the synthesis process. 

As shown, the input is the multiplexed output of the analyzer and the 
processor synthesizer output is a replica of the analyzer speech input. 

Finally, Figure 51 is a general summary of the complete processor 

r 

operation. The word "shoe" is provided as the analyzer input. It is 
then processed, made to vary at a 160 Hz rate and used as the synthesizer 
input. The synthesizer expands the compressed word and presents a replica 
of the input word shoe as its output. Thus, as a result of this, the 
processing operation is complete. 



46 

This concludes the picturial analysis of the narrow bandwidth 
speech processor and it is hoped that some understanding of the system 
operation has been achieved The following section is a discussion 
of applications of processor. 
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IV APPLICATIONS 


Studies in speech processing were initiated to find a means of 
lessening power requirement; s for voice communications from spacecraft 
no earth. In the course of performing these studies, the general 
consensus was that if the bandwidth could be reduced, then transmission 
power being a function of bandwidth, could also be reduced. Under this 
consensus, the idea for the narrow bandwidth processor was conceived. 

Thus, it was hoped that such a processor could some day be used as 
speeecraft flight hardware. To date, and m a general sense, this 
seems possible for the developed processor, m two areas (l) narrow band- 
width, and (2) size-weight requirements. However, there remain factors 
that still must be carefully considered before some definite application 
commitment can be made. In particular, the processor itself is presently 
limited to laboratory tests and evaluations. Reasons for this lie 
within the initial development intent, which was to demonstrate that 
human speech can be transmitted within a 160 Hz bandwidth with an 
acceptable degree of intelligibility. Ho attempt was made to optimize 
intelligibility or to include voice quality. Also, no attempt was made 
to eliminate a listener learning requirement that became obvious as a 
result of the development effort. Thus, the output voice replica of 
the processor is presently an unhumanlike sound that requires an adjust- 
ment period for optimum appreciation. 
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It should be noted that a significant achievement has been made 
with the development of this processor. It is a first and does exhibit 
advancement to the state-of-the-art . Before, no speech processor existed 
that reduced the speech bandwidth to l60 Hz and provided acceptable 
intelligibility . However, for this system to be useful for the intended 
application, improvements are necessary 

The areas of particular interest are optimum Intelligibility, voice 
quality and elimination of the dependency of listener learning. It is 
very possible that from extensive studies of the existing breadboard 
improvements in these areas can be defined along with establishing 
techniques for implementation. 

In conclusion, it is felt that the concept used to develop the 
narrow bandwidth speech processor, is quite an advancement to the state- 
of-the-art and can be developed to possibly achieve the intended appli- 
cation. However, continued development should make this a fact. 
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